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Abstract 

This paper extends the famous statistical results like Fisher Theorem 
and Wilks phenomenon to the penalized maximum likelihood estima- 
tion with a quadratic penalization. Spokoiny (2013a) offered a novel 
approach which allows to study the properties of the quasi maximum 
likelihood estimators for finite samples and possible model misspecifi- 
cation. The results from Spokoiny (2013a) also apply for a growing 
parameter dimension p, however under the constraint "p^/n is small", 
where n is the sample size. This paper shows that in the case of the 
penalized maximum likelihood estimation, the results can be applied to 
arbitrarily large or even infinite dimension p of the parameter space. 
The error bounds depend on the so called effective dimension which 
can be much smaller than the true dimension p of the parameter space. 
We particularly show for the i.i.d. case that the results apply under the 
condition "p^/n is small". 
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1 Introduction 

The Fisher and Wilks Theorems belong to the short hst of most fascinating results in 
the statistical theory. In particular, the Wilks result in its simple form claims that the 
likelihood ratio test statistic is close in distribution to the distribution as the sample 
size increases. So, the limiting distribution of this test statistic only depends on the 
dimension of the parameter space whatever the parametric model is. This explains why 
this result is sometimes called the Wilks phenomenon. This paper aims at reconsidering 
the mentioned results from different viewpoints. Probably the most important issue is 
that the presented results are stated for finite samples. There are only few general finite- 
sample results in statistical inference; see Boucheron and Massart (2011) and references 
therein in context of i.i.d. modeling. The novel approach from Spokoiny (2013a) provides 
a general framework for a finite sample theory, and the present paper illustrates the power 
of this general methodology: the classical large sample can be extended to the finite 
sample case with explicit and sharp error bounds. Moreover, all classical asymptotic 
parametric results are easily derivable from the obtained finite sample statements. 

Another important issue is a possible model misspecification. The classical parametric 
theory is essentially parametric in the sense that it requires the parametric assumption to 
be exactly fulfilled. Any violation of the parametric specification may destroy the Fisher 
and Wilks results. This study admits from the very beginning that the parametric 
specification is probably wrong. This automatically extends the applicability of the 
proposed approach. 

The final issue is the reduction of the model complexity and model dimension by 
using a penalization. In this paper we focus on quadratic-type penalization. Roughness 
penalty approach provides a popular example; cf. Green and Silverman (1994). Tikhonov 
regularization and ridge regression are the other examples which are often used in linear 
inverse problems. It is well known that the use of a penalization in context of an inverse 
problem estimation provides regularization and uncertainty reduction at the same time. 
In general, model selection based on a proper choice of penalization is a high topic, one 
of the central in nonparametric statistics. We refer to Shen (1997), Birge and Massart 
(1998) for the general models and to Birge and Massart (2001, 2007) for Gaussian model 
selection where one can find an extensive overview of the vast literature on this problem. 

First we specify our set-up following to Spokoiny (2013a). Let Y denote the observed 
data and IP mean their distribution. A general parametric assumption (PA) means that 
P belongs to p -dimensional family {IPe^O £ O CI ]RP) dominated by a measure (Xq . 
This family yields the log-likelihood function L(0) = L{Y,e) =^ log^(l^) . The PA 
can be misspecified, so, in general, L{6) is a quasi log-likelihood. The classical likelihood 
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principle suggests to estimate 9 by maximizing the function L{9) : 

5"^= argmaxL(0). (1.1) 

If iP {Pe) 7 then the (quasi) MLE estimate 9 from (1.1) is still meaningful and it 
appears to be an estimate of the value 9* defined by maximizing the expected value of 
L{9): 

9* =^ avgmax ]EL{9) 

which is the true value in the parametric situation and can be viewed as the parameter 
of the best parametric fit in the general case. 

It is well known that the main properties of the quasi MLE 9 like concentration or 
coverage probability can be described in terms of the excess or quasi maximum likelihood 
L{9,9*) =^ L(9) - L{9*) = maxgge L{9) - L{9*) , which is the difference between the 
maximum of the process L{9) and its value at the "true" point 9* ; cf. Spokoiny (2013a). 
The classical Fisher Theorem claims the expansion for the MLE 9 : 

Do{9 - 9*) - ^ ^ 0, 

where Dq = —\7'^1EL{9*) and ^ *== Dq^V L{9*) . Under the correct model specification, 
Dq is the total Fisher information matrix and the vector ^ is centered and standardized. 
So, it is asymptotically standard normal under general CLT conditions. 

The Wilks phenomenon claims that the distribution of the twice excess 2L{9,9*) 
can be approximated by which is asymptotically Xp > where p is the dimension of 

the parameter space: 

2L{9,9*)-m'^0, Uf^xl- 

This fact is very attractive and yields asymptotic confidence and concentration sets as 
well as the limiting critical values for the likelihood ratio tests. 

However, practical applications of all mentioned results are limited: they require 
true parametric distribution, large samples and a fixed parameter dimension. Modern 
applications stimulate a further extension of the classical theory beyond the classical 
parametric assumptions. Spokoiny (2013a) offers a general approach which appears to 
be very useful for such an extension. The whole approach is based on the following local 
bracketing result: 

U,G{0,9*)-<},<L{9)-L{9*)<h,^G{0,9*) + 0„ 9 e 0o. (1.2) 
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Here Lg^G(0,0*) and Le^G(0,0*) are quadratic in — 0* expressions and Oq is a local 
vicinity of the central point 0* . This result can be viewed as an extension of the famous 
Le Cam local asymptotic normality (LAN) condition. The LAN condition considers just 
one quadratic process for approximating the log- likelihood L{0) . The use of bracketing 
with two different quadratic expressions allows to keep control of the error terms Oe, (}e 
even for relatively large neighborhoods Oq of 0* while the LAN approach is essentially 
restricted to a root-n vicinity of 0* . It also allows to incorporate a large parameter 
dimension and a model misspecification. However, the approach from Spokoiny (2013a) 
has natural limitations: the parameter dimension p cannot be too large. For instance, 
in the i.i.d. case, the error terms Oe and <)e are of order ^Jpfijn which destroys the 
Wilks result if p> rv^l"^ . 

A standard way of overcoming this difficulty is to impose a kind of smoothness as- 
sumption on the unknown parameter value 0* . Here we discuss one general way to deal 
with such smoothness assumptions using a quadratic penalization. This includes rough- 
ness penalty of Green and Silverman (1994) and Tikhonov regularization. Section 2 
revisits and restates the bracketing bound (1.2) and all its corollaries for the penalized 
MLE. The main novelty of the approach is the systematic use of the effective dimension 

in place of the original dimension p of the parameter space. In particular, the error 
terms are small if pf/n is small. However, is usually much smaller than p . Even the 
case of a functional parameter space with p = can be included. 

Our main results are similar to the non-penalized case. We obtain a precise expansion 
of the maximum likelihood and the maximum likelihood estimator and show that the error 
terms only depend on the effective dimension . In the i.i.d. case our results are similar 
to Boucheron and Massart (2011). Some asymptotic results for generalized regression 
models are available in Fan et al. (2001). 

The paper is organized as follows. Section 2 states the analog of Fisher and Wilks 
results for the penalized MLE procedure. Section 3 collects the conditions and proofs 
of the main results. Section 4 presents some results from the empirical process theory 
which are used in our proofs. 



2 Quadratic penalization 

Let pen(0) be a penalty function on 6* . A big value of pen(0) corresponds to a large 
degree of roughness or a small amount of smoothness of . The underlying assumption on 
the model is that the true value 0* is smooth in the sense that pen(0*) is relatively small. 
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A penalized (quasi) MLE approach leads to maximizing the penalized log-likelihood: 

6 = argmax|L(0) — pen(0)}. 

Below we discuss an important special case of a quadratic penalty pen(0) = ||G0|p/2 
for a given symmetric matrix G . Popular examples are given by Tikhonov regularization 
with G = >clp and by the roughness penalty; see Green and Silverman (1994). Denote 

LG{e) = m - \\Gef/2, 

6g *== argmaxLG'(0). 
e&0 

The use of a penalty changes the target of estimation which is now defined as 

e*(. =^ argmaxiE;LG(6>). (2.1) 

So, introducing a penalty leads to some estimation bias: the new target 0*q may be 
different from 6* . At the same time, similarly to linear modeling, the use of penalization 
reduces the variability of the estimate Oq and improves its concentration properties. 
An interesting question is the total impact and a possible gain of using the penalized 
procedure. A preliminary answer is that the penalty term at the true point 

should not be too large relative to the squared error of estimation for the penalized 
model. This rule is known under the name "bias-variance trade-off" . 

Another important message of this study is that the use of penalization allows to 
reduce the parameter dimension to the effective dimension which can be viewed as the 
entropy of the penalized parameter space. The resulting confidence and concentration 
sets depend on the effective dimension rather than on the real parameter dimension and 
they can be much more narrow than in the non-penalized case. 

The principle steps of the study are similar to Spokoiny (2013a). In a local vicinity of 
the point 6*q we apply the bracketing device which allows to sandwich the log-likelihood 
process between two quadratic in — 6*q random processes. In the complement of this 
local neighborhood we apply the upper function method which bounds the log-likelihood 
L{0) from above by a deterministic function. The main difference with the non-penalized 
case is that the squared radius of this local vicinity can now be taken proportional to the 
effective dimension while the non-penalized results require much larger neighborhood 
depending on the total parameter dimension p. Similarly to Spokoiny (2013a), the 
obtained results are stated for finite samples, but they are quite sharp and yield the 
usual asymptotic statements including asymptotic normality of the penalized MLE. In 
particular, we present extensions of the Fisher and Wilks results to the penalized case. 
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In what fohows, by C we denote a generic fixed constant. We also suppose that a 
sufficiently large constant x is fixed which specifies random events J7(x) of dominating 
probability. We say that a generic random set i7(x) is of dominating probability if 

lP{f2{x)) > 1 - Ce^^ 

It appears that the results are slightly different in two zone separated by the value Xc . 
For X < Xc , we obtain the same type of bounds as in the Gaussian case, for x > Xc 
they are a bit worse. The value Xc is usually large (of order y/n) where n means the 
sample size; see Section 2.6 for more details. To keep results shorter we only show the 
case X < Xc . The proofs in Sections 3 and 4 contain the full version. 

2.1 Effective dimension 

Let Vq = Var|V-L(0^)} ; see the condition (EDqG) in Section 3.1. Let also G be 
a penalizing matrix. We define three quantities pe , pc i and , each of them can 
serve as the definition of effective dimension. In typical situations these quantities are 
close to each other. The formal definition uses the spectral decomposition of the matrix 
S = V^~^G^V'q~^ . Let si < S2 < . . . < Sp be the eigenvalue of S . The quantity Pe('S') 
is defined as the largest index j for which Sj < 1: 

Pe{S)=max{j:Sj<l}. (2.2) 

The value pe can be viewed as the dimension of the subspace on which Vq > G . Further, 
= pc(S) is the trace of {G^ + V^)-^V^ : 

pc^(5) = ErT^- (2-3) 

j=i j 

Finally, define the p^ = Ps(5') which appears in the bound for the local entropy Q; see 
Theorem 2.2 below: 

Ps{S)^^'pe{S) + j2^j'^og4^,). (2.4) 

It is obvious that 

Pe < Pg < Ps- 

However, if the eigenvalues Sj grow sufficiently fast to ensure that sj^ log(sj) is 
bounded by a fixed constant, then the quantities Pe('S') and Ps{S) are of the same 
order; see the examples in Section 2.5. 
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Below we show that the use of penaHzation enables us to replace the original dimension 
p in our risk bounds with the effective dimension which can be much smaller than 
p depending on the matrices Vq and G . 

2.2 Local bracketing 

This section extends the local bracketing construction of Spokoiny (2013a) to the penal- 
ized situation. The idea is to sandwich the penalized log-likelihood process Lg{0, 6*q) = 
Lg{0)—Lg{Oq) between two quadratic in — 6*q processes Le^G'(0,0^) and L£^g(^i^g') 
in a vicinity of the point 6*q from (2.1). The whole construction involves two pxp matri- 
ces Dq *== —V'^]EL{0*q) and = Var{VL(0g)| as well as two non-negative constants 
e = ((5, q) . Define 0^ = 0^ + 0^, 

eh) = {0 - ehVvaoh) - WDe^oio - 0^)|| V2, 
Die = (1 - m + evS = Dl- 5Dl - eVS 

and similarly for e = — e = {—Q, —5) . This definition implicitly assumes that the con- 
stants g, 5 are selected to ensure q > . 

Our first result is restricted to the local set 0o,g(i") defined as 

0o,G(r) = {e : \\Vo{e - 0^)|| < r, < r}. 

Theorem 2.1. Assume {EDqG) , {EDiG) , and {LqG) on 0o,G(r) . Let e = {q,6) 
with 6 > (5(r) , g > Suq uj{r) , and > . Then for all 6 G 6*0,0(1") 

Le,G(0, 0*g) - Oe,G{^) < LciO, 0}.) < L,,g(0, 0^) + Oe,G(r). 

The error term (}^^g{z) obeys on a random set f?(x) of dominating probability 

❖e,G(r) < £>3o(x,Q) (2.5) 

where 3o(x,Q) is given under the condition gp > 2(x -|- Q) by 

3o(x, Q) (1 + Vx + Q) ' « X + Q 

with Q = Cps for ps = Ps(5') from (2.4), and S"^ = Vq'^G'^Vq^^ . A similar bound holds 
for <>e,G(r) in place of Oe,G(r) . 

Proof. Consider the quantity 

0,,G{r) sup {ae, ey) -{e- e^yvaeh) - f l|Vo(0 - eh)f], 
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and similarly for {>e,G(r) with e = — e . Now Theorem 4.11 applied to 

u{e, e*a) = -^M{o, o*a) -{6- e*a)'vaeh)] 

with Hq = Vq, g = G , yields (2.5). □ 

This bracketing result essentially repeats the one for the non-penalized case from 
Spokoiny (2013a), Theorem 3.1. The only difference is that the entropy value Q is 
now proportional to the quantity from (2.4) and not to the real dimensionality p . 
Depending on the structure of the penalizing matrix , the gain can be enormous. 

2.3 Concentration and a large deviation bound 

This section demonstrates that the use of the penalty term helps to strengthen the 
concentration properties of the penalized qMLE ■ Namely, we show that Oq be- 
longs with a dominating probability to a set 6'o,g(i'o) which can be much smaller than 
a similar set from the non-penalized case. The main result claims that the probabil- 
ity JP{6g 0o,g(i"o)) is small if Tq significantly exceeds Cps for a fixed constant 
C. Further we present an upper function Ug(^) ensuring the uniform upper bound 
Lg{9, 6*0) < -uciO) for ah with the probability at least 1 - 2e-'' . 

The key step in this large deviation bound is made in terms of an upper function for 
the process Lg{0,6q) . Namely, ug{0) is a deterministic upper function on 0i C if 
it holds on a set I7(x) of dominating probability 

LG{e,e*G) <-UG{e), eeOi. 

Theorem 2.2. Suppose {EG) and (£G) with b(r) = b > . Let for r > rg , the 

following conditions he fulfilled: 

1 + Vx + Q < 3z^o'g(r)/b, 

+ Q < rob. (2.6) 

Then Ug{0) =0 is an upper function on \ 6*0,0(^0) o-'^d it holds 

P{0 0o,G(ro)) < 2e-^ 

Proof. The result is a special case of Theorem 4.11 in Section 4. □ 

The condition (2.6) requires that ro fulfills > CQ > Cp^ . In the non-penalized 
case of Spokoiny (2013a), a similar condition reads as rg > Cp , so the use of penalization 
helps to improve the concentration properties of the penalized MLE. 
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2.4 Wilks and Fisher theorems 

The next result collects the corollaries of the exponential bound of Theorems 2.1 and 2.2. 
Define a random p -vector 

Theorem 2.3. Assume (EDqG) , {EDiG) , and (^qC) for Tq > Cp^ . Also assume 
the identifiability condition (IG) Vq < o?Dq with a constant a > . Let 5 > 5(ro) , 
g > 3uqlo{tq) , and Te '= S + o?q < 1/2. Then it holds on a random set I7(x) of 
dominating probability 

'2LG{eG,e*G)-UGf\ < Ct,p„ (2.7) 

\\DG{eG - 0*g) - ^Gf < Ct.Ps (2.8) 

The classical Fisher and Wilks results include some statements about the limiting 
behavior of the vector ^g ^^'^ of the quadratic form H^c'p. In the i.i.d. case, one 
can easily show that the vector ^g is nearly standard normal; see Section 2.6 below. 
However, it is well known that the convergence of "distribution is quite 

slow even in the case of a fixed dimension p . For finite sample inference, we recommend 
to combine the approximation (2.7) with any resampling technique which mimics the 
specific behavior of the quadratic form ll^fjlP ; see Spokoiny et al. (2013) for applications 
to quantile regression. 

2.5 Examples of computing the effective dimension 

This section presents a couple of typical examples of using the quadratic penalty: block- 
wise penalization and estimation under a Sobolev smoothness constraint. For simplicity 
of presentation we assume that Vq = cr/p , while G is diagonal with non-decreasing 
eigenvalues gj . It holds = a~'^G'^ , and Sj = o'~^gj . The formulas (2.2), (2.3), and 
(2.4) apply here for computing the effective dimension. 

2.5.1 Block penalization 

Consider the case when G is of a block structure: G = diag{Go, Gi} . The first block of 
dimension pQ corresponds to the unconstrained part of the parameter vector while the 
second block of dimension pi corresponds to the low energy component. An interesting 
question is the minimal penalization Gi making the impact of the low energy part 
inessential. Assume for simplicity that Go = and Gi = glp^ . Then 
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One immediately obtains pe = po provided g > a . Further pc = Po + '^^9~'^Pi and 
Ps = Po ~^ Pi^9~^ ^og{g/a) . One can see that the impact of the second block Gi in the 
effective dimension is inessential if condition that g/a ^ Pi/po ■ 

2.5.2 A Sobolev smoothness constraint 

Consider the case with Vq = a'^Ip and G = diagj^i, . . . ,gp} with gj = Ljl^ for j3 > 1 . 
The value (3 is usually considered as the Sobolev smoothness parameter. The value pe 
is the largest one satisfying Lj^ < a . Further, 



j=Pe + l 

It is straightforward to see that (3 > 1 yields Ps < Pe + C(/3) for some fixed constant 
C(/3) depending on /3 only. 

2.6 I.i.d. case and further examples 

We start this section with a general remark about applicability of the obtained results 
in particular examples and models. It appears that the conditions for the penalized 
MLE are weaker than the corresponding conditions in the non-penalized case. Spokoiny 
(2013a) already checked the conditions for i.i.d. case, GLM and quantile regression. So, 
the previous results apply immediately under the conditions of Section 5 of Spokoiny 
(2013a) to the penalized MLE. Spokoiny et al. (2013) illustrates how the Wilks and 
Fisher results of Theorem 2.3 can be used in context of local quantile regression. 

Below we briefly discuss the case of an i.i.d. model with n observations. We suppose 
that for each n a parameter set O depends on n as well as the parameter dimension 
p = Pn- The penalty term pen(0) is still in the form pen(0) = ||G0|p/2, the log- 
likelihood L{0) is in the form 




while 



Pe + 




n 



where (.{y, 6) is the log-density of one observations. It holds 



Dl = n¥{e*a), 
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where ¥{6) = -V'^]El{Yi,e) , vg(0) = Var{/(yi,0)} . The value is defined as pre- 
viously by (2.4). Note that all the introduced quantities may depend on n and Pn ■ 
The main goal is to show that the presented general approach yields sharp results in 
this special case. We also show that the rule "p^/'ra small" extends to the penalized 
case with p replaced by ps . Suppose that the conditions of Section 5.1 from Spokoiny 
(2013a) are fulfilled. Then one can check the conditions of this paper from Section 3.1 
with 6{r) = Cr/y/n and q{t) = Cr/y^ yielding = Cro/y^. Moreover, due to The- 

-0 

small" translates into "pf/ra small". 



orem 2.2, the local radius tq can be taken as Tq = Cp^ . Now the condition "rgp^ is 



Theorem 2.4. Suppose that /3„ = ypf/n — )■ as n —)• oo . Suppose also that the 
conditions of Theorem 5.1 from Spokoiny (2013a) are fulfilled for each n . Then 

2Lg(0g,0g)- ll^cf I < C/3„, 
\\DG{eG - e*a) - icf < C/3n. 

As already mentioned, both expansions follow from the general statements of Theo- 
rem 2.3. In addition, one can use the central limit theorem to state a kind of asymptotic 
normality for the random vector of growing dimension. We refer to Andresen and 
Spokoiny (2013) for a version of such result in context of semiparametric profile estima- 
tion. 



3 Proofs 

This section presents the conditions and the proofs of the main results. 
3.1 Conditions 

This section presents the list of conditions which is essentially the same as in the non- 
penalized case from Spokoiny (2013a). However, the use of penalization leads to some 
change in each condition. The most important fact is the reduction of the entropy value 
Q caused by penalization. Note first that the stochastic components of the original 
process L{d) and the penalized process LciO) = L(0) — ||G0|p/2 coincide. However, the 
local conditions {EDq) and (EDi) on the stochastic component of the process L{6) have 
to be restated by two reasons. First of all, the central point 6* is changed to 9q . Second, 
the use of penalization allows to reduce the size of the local neighborhood in which the 
quadratic bracketing of the process L{6) has to be applied. Let Vq = Var{VL(0^)} . 
In the non-penalized case of Spokoiny (2013a) the local conditions are assumed to be 
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fulfilled on the local set 6'o(r) = |0 : ||Vo(0 — 0*)|| < r|. Now we restrict this set by 
intersecting with the set {0 : \\G9\\ < r} 

0o,G(r) = {e : \\Vo{e - e*G)\\ < r, \\Ge\\ < r}. 

Similarly to the non-penalized case we distinguish between local and global conditions. 
The global conditions concern the global behavior of the process L(9) while the local 
conditions focus on its behavior in the vicinity of the central point Oq . Below we suppose 
that degree of locality is described by a number r . The local zone corresponds to r < rg 
for a fixed tq . 

(EDqG) There exist a positive symmetric matrix Vq , and constants g > , > 1 
such that \ai{VC{0*G)] < and 

sup logigexpjA ^^.y/^^^H < vl\^/2, |A| < g. 

(EDiG) For each r < tq , there exist a constant Lo{r) < 1/2 such that it holds for all 
G 0o,G(r) 

,„,«e.p|A2:(Xi<^™Ml| < „„..V2, |A| < g. 

7esp I a;(r)||yo7ll J 

Note that the matrix Vq and constants g, , and CL'(r) can be the same as in the non- 
penalized case but can also be improved because of restricting to the smaller set 0o,Gi^) ■ 
For avoiding ambiguous notations, we apply the same symbols as in the non-penalized 
case. 

In the same way, the local identifiability condition (£o) of Spokoiny (2013a) has to 
be restated for 1ELq{9) and the local set Oq^g{'^) ■ Note that the definition (2.1) implies 

V]ELg{6*g) = V]EL{e}.) - G'^e}. = 0. 

{JCjqG) There are a matrix Dq and for each r < tq , a constant 5{r) such that it holds 
on the set Oo^ci^) ■ 



-iEL{e,e*a)-{e-e*a)'G'o*G 
\\Do{e-e*aW/2 



< 5{r). 



If the expected log-likelihood 1EL{0) is two times continuously differentiable then con- 
dition {LqG) with Dq = —\/'^1EL{6q) follows from the second order Taylor expansion: 

\-]EL{0,0*a) + {e- e*aVv]EL{e*a) - \me - e*G)f/2\ 
<6{r)\\Do{e-e*a)f/2. 
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The identifiability condition relates the matrices Dq = Dq + and Vq . 

{XG) There is a constant a > such that o^D'q > Vq . 

In the non-penahzed case of Spokoiny (2013a), this condition reads as o^Dq > Vq . 
Therefore, the use of regularization helps to improve the identifiability in the regularized 
problem relative to the non-penalized one. 

The global conditions are assumed to be fulfilled on the whole parameter set . The 
first condition is similar to the local condition {EDqG) and it requires some exponential 
moment of the gradient VC,{6) for all 0^0. However the constant g may be dependent 
of the radius r = ||Vb(0 - 6}.)\\ . 

{EG) For any r , there exists a value g(r) > such that for all A < g(r) 



sup sup log jEexp<{ A n f ^ ^o-^ 



I^otII J 



Define Lg{0, 0*q) = Lg{0) — Lg{Oq) . The global identification property means that 
the deterministic component ]ELg{0,Oq) of the penalized log-likelihood is competitive 
with its variance Var Lg{0, Oq) = Var L{6, 6q) . 

{/CG) There exists a matrix > Vq and for each r > tq a value b(r) > such that 
for any 9 with \\Vo{e - = r 

-lELG{e,e*a) 



\\vi9-e*G) 



> b(r). 



Spokoiny (2013a), Section 5, explains how the general local and global conditions can 
be checked in some special cases including i.i.d. model, generalized linear models and 
median regression. This can be done in the same way for the case of a penalized MLE. 

3.2 Proof of Theorem 2.3 

Below we fix r = ro and set 0o,g = &o,g{^o) , <>e,G = Oe,G(ro) , <>e,G = <>e,G{^o) ■ By 
Theorem 2.1, each of them satisfies for a properly selected x on a random set f2{x) of 
dominating probability 

<>e,G < C^3o(x, Q) ~ Cg (p, + x) < Ct,p, . (3.1) 

Now we present the results for the upper and lower excess defined by optimizing Lg^G'(0, 6g) 
and Le G'(0,0^). Let ^^g be defined by 

^e,G = D~h^aeh) = D~},{vL{e*G) - G'e*G}. 

By definition, Oq maximizes 1ELg{0) yielding ]E^^g — ^- 
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Lemma 3.1. It holds 

sup h,^G{0,e*a) < snph,^G{0,e*a) = ||^,,gIIV2, (3.2) 

0G©o,G 

Moreover, it holds on the random set {D^Q$e,G ^ ®0,g} of dominating probability 

sup Le,G(0, e*G) = supL,,G(0, 0*g) = ||C,gII V2 < ||^e,Gll V2. (3.3) 

Proof. The statement (3.2) follows by direct maximization of the quadratic form 'L^^q{9, 6*q) 
Similarly one obtains (3.3). It remains to evaluate the probability that D~q^^ q G Oq^g ■ 
It holds in view of L'^ ^ < (1 + Te)-DG and < a^D^, 

^{DeMe,G ^ 0O,g} = ]P{\\VoD^},^,^a\\ > ^o} 

< lP{\\VoD^^VC{e*G)\\ > ro} 
<lP{\\VoD^^Vaeh)f>{l + r.)rl} 
<lP{UGf>a'a + re)rl}, 

where = D^^ \7({6q) . The latter probability is negligible if Tq > Cp^ is selected 
properly, see Theorem 3.2 below. □ 

Further, define the process 

Lg(0, r ) = {6- e*)^vc{e*G) - \\DGie - r)||V2. 

Similarly to Lemma 3.1, it holds on the random set {||Vo-D^^V|| < ro} of dominating 
probability 

supLG(0,r)= sup LG(0,r) = J||^G||' (3-4) 
e ee0o,G ^ 

with = Dq^ V("(^g) ■ Theorem 2.1 implies 
sup h,^GiO,0*)-0,^G 
< sup LG{e,e*)< sup L,,G(0,^*) + Oe,G. (3.5) 

By Lemma 3.1, one can replace the sup of Le G(^)^*) over ©q.g with the sup over the 
whole vector space IHP . Putting all the obtained bounds together yields 

Lg(Og,0*) > SUpL,,G(0,0*) - Oe,G, 

(3.6) 

Lg{Og,0*) < supLe,G(6',r) + 0e,G. 
e 
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Define 



e e 

□e,G =' SUpLG(0,r) -SUpL,,G(0,^*) 

e 
The identities (3.2), (3.3), and (3.4) imply with V = VC{0*g) 

2U,,G = \\D:},V\\^-\\D-^^vf, 

2n,,G = \\D~}vf -\\D-].v\\\ 



Define 



Qe,G =' \\DgD-}.Dg - Ip*\ 
Qe,G =^ II V ~ DgD~},Dg\ 



By the regularity conditions (TG) a^D^ > Vq yielding for D^q = -0^(1 — 5) - gV^^ 





Dlil - T,) < Dla < Dl, 

Dl<Dla<Dl{l + T,). 

with Tg = (5 + qot"^ . Thus, the quantities a^^G and ae,G satisfy 

«e,G < - 1 = ae,G < 1 " = • (3.7) 

This yields for = D^^V 

2n,,G < «e,G||^G||'> 2n,,G < a6,G||^G||'- 

Theorem 3.2 below yields that < Ps + x on a random set i7(x) of dominating 

probability provided a proper choice of x . We conclude from (3.7) that 

□e,G < Crep., Ue,G < CTeP.. (3.8) 

Further, (3.6) and (3.4) yield 

LG(eG,e*) > ^||^G||'-ne,G-Oe,G, 

Lg{Og,0*) < ^||^G||' + ne,G + Oe,G. 

This yields by (3.1) 

LG(0G,r) - i||^G|r| <CrePs. 
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The proof of (2.7) is completed. 

Now we derive the expansion for the parameter estimator 6q. On the set Ce(r) , 
the bracketing bound (3.5) imphes 

Lg(0g,o*) > suph,,G{o,e*) - o,,G 

e 

= ||I)^^V||V2 - Oe,G 

The bracketing bound (3.5) apphed at imphes 

LG{eG,0*) < h,^G(OG,e*) + Oe,G- 

These two bounds together yield by the definition of Le^G(0, 6*) 

(9g - r)^V - \\\D,,GiOG - 0*)\\^ > ^p-^Vf - Oe,G - Oe,G " ^e.G " ^e.G 

and thus 

||^e,G(^G - 0*) - D;}.Vf < 2{n,,G + □e.G + Oe,G + O^.g} =' 2Z^e,G • (3.9) 
The condition (IG) implies the inequality ^D^^^ ||^ > 1 — and hence, 

\\DGD-'^DG\\^<il-T,r'. 

This and (3.9) provide 

\\DG(eG - e*) - DGD~lvf < ^^^'^ 



Similarly 



\DgD~'^V - D^'V\\ = \\{DgD:^IDg-Ip^)D^'v\ 



Putting together the last two bounds yields 



||2^g(^g - ^g) - Da^n < + T^ll^^l 

This yields the statement (2.8) in view of (3.1) and (3.8). 
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3.3 A deviation bound for the quadratic form H^gII 

This section presents a bound for a quadratic form H^dP where = Dq^ . 
The result only uses the condition (EDqG) which we restate in a slightly different form. 
Define 

Let also Xq '= Amax(-2?G) be the largest eigenvalue of JBg ■ li Vq = Dq , then Ac = 1 . 
Below we assume that Xq < 1 , otherwise the results have to be restated with H^gIPZ-^G 
in place of • Also define 

Note that pc = . Moreover, if is a Gaussian vector then = VarQl^gp) . 

The condition {EDqG) implies that the vector rj =^ Vq^^ ^CI^g) fulfills the following 
exponential moment condition: 

logiBexp(7'^r/) < ||7||V2, 7 G Si", \h\\ < g- 

Here i^q is set to one. Spokoiny (2013b) argued how the case of any z^o ^ 1 can be 
reduced to z^o ^ 1 by a slight change of scale and reducing the value g which is typically 
large. For ease of presentation, suppose that > 2pG ■ The other case only changes the 
constants in the inequalities. Note that = rj^ IBg'H ■ Define /ic = 2/3 . Define also 

dcf / n 

gc = l-lcYc = Vg - McPG, 

2x, = fi.yl + log det {ip - ix.bI) . (3.10) 
It is easy to see that > 3g^/2 and gc > y^2/3 g . 

Theorem 3.2. Let {EDqG) hold with uq = I and g^ > 2pG ■ Then for each x > 

iPdl^cf >3(x,iBG)) < 2e-^ + 8.4e"^% 
where ^(x, iBc) is defined by 

IPG + 2vGxi/2, x<vg/18, 
PG + 6x, vg/18 < X < Xe, 

lye + 2(x - Xc)/gcf , X > Xc. 



18 



WiLKS Theorem for penalized MLE 



Depending on the value x , we observe three types of tail behavior of the quadratic 
form II^gIP- The sub-Gaussian regime for x < vc/lS and the Poissonian regime for 
X < Xc are similar to the case of a Gaussian quadratic form. The value Xc from (3.10) is 
of order . In all our results we suppose that and hence, Xc is sufficiently large and 
the quadratic form H^dP can be bounded with a dominating probability by Ag(pg' + 6x) 
for a proper x < Xc . For large deviation zone x > Xc with have a special behavior which 
is still sufficient to get finite polynomial moments of H^dP . We refer to Spokoiny (2013b) 
for the proof of this and related results, further discussion and references. 

4 Some results for empirical processes 

This chapter presents some general results of the theory of empirical processes. We 
assume some exponential moment conditions on the increments of the process which 
allows to apply the well developed chaining arguments in Orlicz spaces; see e.g. van der 
Vaart and Wellner (1996), Chapter 2.2. We, however, follow the more recent approach 
inspired by the notions of generic chaining and majorizing measures due to M. Talagrand; 
see e.g. Talagrand (1996, 2001, 2005). The results are close to that of Bednorz (2006). We 
state the results in a slightly different form and present an independent and self-contained 
proof. 

The first result states a bound for local fluctuations of the process U{v) given on 
a metric space T . Then this result will be used for bounding the maximum of the 
negatively drifted process U{v) — U{vq) — p(P{v,vq) over a vicinity T"o(r) of the central 
point 1^0 . The behavior of 'U.{v) outside of the local central set T'o(r) is described using 
the upper function method. Namely, we construct a multiscale deterministic function 
u(/x, v) ensuring that with probability at least 1 — e~^ it holds fMU{v) + u(/i, v) < ^(x) 
for all V T'o(r) and /x € M , where ^(x) grows linearly in x . 

4.1 A bound for local fluctuations 

An important step in the whole construction is an exponential bound on the maximum 
of a random process U{v) under the exponential moment conditions on its increments. 
Let d{v,v') be a semi-distance on T . We suppose the following condition to hold: 

(£<i) There exist g>0, ro>0, fo>l, such that for any A < g and v,v' £ T with 
d{v,v') < ro 
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Formulation of the result involves a sigma-finite measure tt on the space T which 
is often called the majorizing measure and used in the generic chaining device; see 
Talagrand (2005). A typical example of choosing vr is the Lebesgue measure on IRP . 
Let T° be a subset of T , a sequence be fixed with rg = diam(T°) and = tq2~^ . 
Let also 'Bfc(ij) '= {v' € T° : d{v,v') < r^} be the d-ball centered at v of radius r/j 
and TTkii^) denote its vr -measure: 

TTkiv) = [ TT{dv') = I ^d[v,v') < rk)Tr{dv'). 

Denote also 

Mfc'^^^fmax^^ k>l. (4.1) 
Finally set ci = 1/3, Ck = 2-^+2/3 for k>2, and define the value Q{T°) by 



o\ def 



oo ^ A ^ 

Ck log(2Mfc) = - log(2Mi) + ^ log(2Mfe). 



3 

k=l k=2 

Theorem 4.1. Let U be a separable process following to {£.d) . If T° is a d-ball in 
T with the center v° and the radius tq , i.e. d{v,v°) < tq for all v £ T° , then for 
A < go = I'og 

logiBexpl^— sup |U(t^) -U(t^°)|| < AV2 + Q(r°). (4.2) 

Proof. A simple change U{-) with Uq^U{-) and g with go = z^og allows to reduce the 
result to the case with vq = 1 which we assume below. Consider for > 1 the smoothing 
operator defined as 



§fc/(^°) = / f{vy{dv). 

Further, define 

so that EiqII is a constant function and the same holds for S^-S^.i . . . E^qM with any 
k > 1 . li f{-) < g{-) for two non- negative functions / and g , then Skfi') ^ Sfcfl'(") • 
Separability of the process U implies that lim^ §kU{v) = U(i>) . We conclude that for 
each 1) G T° 

\U{v) - U{v°)\ = lim \SkU{v) - Sfc . . . SoU{v)\ 

fc— i>oo 



k 

^ ^^lY.\^k...m-^^-l)u{v)\<Y,^: ■ 



1=1 i=l 
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def 

Here = sup^gj-o ^k{i^) for k > 1 with 

6H = \SiU{v) - U{v°)\, (k{v) = \Sk{I - Sk^i)U{v)\, k > 2 
For a fixed point v'^ , it holds 

Ck{v^) < — ^ / ^7-7 / \U{v) - U{v')\7T{dv')n{dv). 

TTfci'U'') Trfc-lit^j ySfc_i(«) 

For each v' G , it holds d{v,v') < r^.i = 2rfc and 

\U{v)-U{v')\ 



\U{v)-U{v')\<rk^ 



d{v, v') 

This implies for each v'^ € T° and /c > 2 by the Jensen inequality and (4.1) 

X , i,} f ff X\U{v) -U{v')\ Tr{dv') \ TT{dv) 



>{^a.(^«)} 



exp<^ Ckiv^) } < / exp 



vAfe_i(t.) d{v,v') TTk-liv) J TTk{vi) 

X\U{V) - U{v')\ TT{dv') \TT{dv) 



As the right hand-side does not depend on , this yields for sup^gj-o ^fe(i') by 

condition {8.d) in view of e'^'' < + e~^' 

^rfc_i / yr°VAfe_i(i^) 7rfc„i(t^)y 7r(r°) 



= 2Mfcexp(AV2). 



Further, the use of d{v,v°) < tq for all t; G T° yields by (fid) 

iBexpl — |U(?^) - U{v°)\] < 2exp(AV2) (4.3) 
L rn J 



and thus 



lEexp\ — \SiU{v) -U{v°)\] < [ lEeKp\ — \U{v')-U{v°)\]TT{dv') 

"-I'D J vri(tj) Jsi('u) "^^o J 

^ JEexp|A|u(t^') -'"(^°)lV('^^^')- 



■k{T°) Jyo Lro 
This implies by (4.3) for = sup^gj-o \^iU{v) - U{v°)\ 

iEexpf— < 2Miexp(AV2). 
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Denote ci = 1/3 and = rfc_i/(3ro) = 2 '^"''^/S for k>2. Then J2T=i'^k = 1 and it 
holds by the Holder inequahty; see Lemma ?? below: 

logiEexpl — J^Cfc) < cilogiEexpf -en + J^CfelogiEexpf ^fc 

oo 

< AV2 + ci log(2Mi) + X] Cfc log(2Mfc) 



k=2 

v2 



< A72 + ' 

This implies the result. □ 

The exponential bound of Theorem 4.1 can be used for obtaining a probability bound 
on the maximum of the increments U{v) — U{v') over T° . 

Corollary 4.2. Suppose {Ed) . If T° is a central set with the center v° and the radius 
tq , then it holds for any x > 

ip( sup U{v,vq) > 3(x,Q) ) < exp(-x), (4.4) 
where with go = t'og ^^'^ Q - 



3(x,Q) = <( ^ (4.5) 
go (x + Q) + go/2 otherwise. 

Proof. By the Chebyshev inequahty, it holds for the r.v. ^ *== sup^gj-o U{v , vq) / (SuoTq) 
for any A < go by (4.2) 

log iP(e> a) < -A3 + logiEexp{Ae} < -A3 + AV2 + Q. 



Now, given x > , we choose A = y^2(x + Q) if this value is not larger than go , and 
A = go otherwise. It is straightforward to check that A3 — A^/2 — Q > x in both cases, 
and the choice of 3 by (4.5) yields the bound (4.4). □ 



4.2 A local central bound 

Due to the result of Theorem 4.1, the bound for the maximum of U{v,vq) over v G 
'Bj-{vq) grows quadratically in r. So, its applications to situations with ^ Q(^°) 
are limited. The next result shows that introducing a negative quadratic drift helps 
to state a uniform in r local probability bound. Namely, the bound for the process 
U{v,vq) — pd'^{v,VQ)/2 with some positive p over a ball 'Bj-{vq) around the point 
vq only depends on the drift coefficient p but not on r . Here the generic chaining 
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arguments are accomplished with the shcing technique. The idea is for a given r* > 1 
to spUt the bah 'Bj-*{vq) into the shoes Sr4-i(fo) \ '^rifJo) and to apply Theorem 4.1 to 
each slice separately with a proper choice of the parameter A . 

Theorem 4.3. Let r* be such that (£d) holds on Sr*('Uo) ■ Let also Q(r°) < Q for 
T° = !Br(t^o) with r < r* . If p > and 3 are fixed to ensure \J1pi < go = t'og (^'^^ 
pid — 1)^2, then it holds 

logipf sup l^U{^j,vo)-^d\v,vo)] >i] 

< - 1) + log(43) + Q. (4.6) 
Moreover, if \/2p^ > go , then 

logipf sup \^U{v,vo)-^d\v,vo)\ >i] 

< -SoVpii - 1) + go/2 + log(43) + Q. (4.7) 

Remark 4.1. Formally the bound applies even with r* = 00 provided that (S-d) is 
fulfilled on the whole set T° . 

Proof. Denote 

Then we have to bound the probability 

iP(sup{ru(r) -prV2} > 3). 

r<r* 

For each r < r* and A < go , it follows from (4.2) that 

logiEexp{Au(r)} < AV2 + Q. 

The choice A = \/2p^ is admissible in view of \J1pi < go . This implies by the exponential 
Chebyshev inequality 

log iP(r u(r) - prV2 > 3) < - A(3/r + pr/2) + AV2 + Q 

= -P3(x + x-^ - 1) +Q, (4.8) 



where x = p/{2i) r . We now apply the slicing arguments w.r.t. t = pr^/2 = 32;^. 
By definition, ru(r) increases in r . We use that for any growing function /(•) and any 
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t > , it holds 

rt+l 



{f{s)-s + l}ds 
Therefore, for any t > , it holds by (4.8) in view of dt = 2jxdx 

ipf sup {ru(r) - prV2} > 3^ < / iP{r u(r) - t > 3 - ijdt 

Vr<r* / Jo 

< 23 / exp{-p{^-l){x + x-^ -1)+Q}xdx 

Jo 

/•oo 

< 236"^+^ / exp{-b{x + x-'^ - 2)} X dx 

Jo 

with b = p{i — 1) and t* = pr*"^ /2 . This implies for 6 > 2 

iP sup {r u(r) -/>rV2} > 3^ < 236"^+'^ ^ exp{-2(x + - 2)}2; dx 

< 43exp{-/3(3-l) + Q} 

and (4.6) follows. 

If ^/2pl > go , then select A = go . For r < r* 

logiP{ru(r)-prV2>3} = logiP{u(r) > 3/r + pr/2} 

< -A(3/r + pr/2) + AV2 + Q 

< -\^{x + - 2)/2 - AVp3 + + Q, 



where x = \J p/3 r . This allows to bound in the same way as above 

Jp(^sup{ru(r)-prV2} >3) < 43 exp(-AVp(3 " 1) + + Q) 
yielding (4.7). □ 

This result can be used for describing the concentration bound for the maximum of 
(3fo)~^'U^(^i 1^0) — pd'^{v,VQ)/2 . Namely, it suffices to find 3 ensuring the prescribed 
deviation probability. We state the result for a special case with p = 1 and go > 3 
which simplifies the notation. 

Corollary 4.4. Under the conditions of Theorem 4-3, for any x > with x + Q > 4 
pi sup \^Uiv,vo)-ld\v,vo)} >^oi^,(Q)] <exp{-x), 
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where with go = z^og ^ 2 



io{^,Q)={' V . ' ■ ' ■ (4.9) 

1 + { 2go ^ (x + Q) + go } otherwise. 



[l + V^+Q? if 1 + V^+Q< go, 



Proof. First consider the case 1 + \/x + Q < go • In view of (4.6), it suffices to check 



that 3 = (l + + Q)^ ensures 

3 - 1 - log(43) - Q > X. 

This follows from the inequality 

(l+y)2_i_21og(2 + 2y) >y2 



with y = Vx + Q > 2 . 

If 1 + V^+Q > go , define 3 = 1 + y^ with y = 2go ^(x + Q) + go • Then 



go\/3^- log(43) - g2/2 - Q - X = goy/2 - log{4(l + y^)} > 
because go > 3 and 3y/2 - log(l + y^) > log(4) for y > 2 . □ 

If g ^ "v/Q and x is not too big then 3o(x,Q) is of order x + Q. So, the main 
message of this result is that with a high probability the maximum of {3uq)~^'U,(v, vq) — 
(P{v,vq)/2 does not significantly exceed the level Q. 

4.3 Finite-dimensional smooth case 



Here we discuss the special case when T is an open subset in , the stochastic pro- 
cess U{v) is absolutel 
exponential moments. 



cess U{v) is absolutely continuous and its gradient Vll(i^) *== d'U.{v)/dv has bounded 



(£D) There exist g > , f > 1 , and for each v £ T , a symmetric non-negative 
matrix H{v) such that for any A < g and any unit vector 7 G ]RP , it holds 

7^VU(?^) 



)7ll 

A natural candidate for H'^iv) is the covariance matrix Var(VlI(t;)) provided that 
this matrix is well posed. Then the constant vq can be taken close to one by reducing 
the value g . 

In what follows we fix a subset T° of T and establish a bound for the maximum of the 
process \i{v, v°) = U(i') — U(i>°) on T° for a fixed point v° . We will assume existence 
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of a dominating matrix H* = H*{T°) such that H{v) ^ H* for ah v ^ T° . We also 
assume that vr is the Lebesgue measure on T . First we show that the differentiabihty 
condition (£D) imphes (Sd) . 

Lemma 4.5. Assume that (ED) holds with some g and H{v) ^ H* for v ^ T° . 

Consider any v,v° G T° . Then it holds for |A| < g 



< 



'°^^"H' iig-(V-v°)ii l 

Proof. Denote (5 = ||?; — ?;°|| , "f = {v — v°)/6 . Then 

U{v, v°) = [ VU{v° + t6j)dt 
Jo 

and \\H*{v — v°)\\ = 5\\H*-f\\ . Now the Holder inequality and (W) yield 

U{v,v°) u^X'^ 



lEex.pi X 



\H*{v-v°)\\ 2 



lEexp 







A- 



|iJ*7| 



as required. □ 

The result of Lemma 4.5 enables us to define d{v,v') = \\H*{v — v°)\\ so that the 
corresponding ball coincides with the ellipsoid B(r, v°) . Now we bound the value Q(T°) 
for T° = B{to,v°). 

Lemma 4.6. Let T° = B{rQ,v°) . Under the conditions of Lemma 4-5, it holds Q(T°) < 
Cip , where Ci = 2 for p > 2 , and ci = 2.7 for p = I . 

Proof. The set T° coincides with the ellipsoid B{rQ,v°) while the d-ball "Bkiv) coin- 
cides with the ellipsoid B{rf:,v) for each k>2. By change of variables, the study can 
be reduced to the case with i;° = , H* ^ Ip , ro = 1 , so that B(r,v) is the usual 
Euclidean ball in of radius r . It is obvious that the measure of the overlap of two 
balls -6(1,0) and B{2~^'^^,v) for < 1 is minimized when ||?;|| = 1, and this value 
is the same for all such v . 

Now we use the following observation. Fix v'^ with Hf"]] = 1. Let r < 1, = 
(1 - r^/2)v^ and = r - r^/2 . li v € B{t\v^) , then v G B{r, v^) because 

\\v^ - v\\ < \\v^ - v^W + \\v -v^\\< r^/2 + r - r^/2 < r. 
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Moreover, for each v G B{r^, v^) , it holds with u = v — 

\\vf = \\v'f + \\uf + 2u'^v^ < (1 - rV2)' + |rf + 2u^v^ < 1 + 2u''v\ 

This means that either v = + u or — u belongs to the ball B{rQ,v°) and thus, 
Tr{B{l,0)nB{z,v)) >Tr{B{z\v^))/2. We conclude that 

vr(i?(l, 0)) ^ 2^(^(1,0)) ^ - r'/2)-P. 



7r{B{l,0)nB{r,vi)) " 7r(S(r^0)) 

This implies for /c > 1 and r^. = 2~^+^ that 2Mfc+i < 22+'^?'(1-2~^-1)~p . The quantity 
°) can now be evaluated as 

°) < i log(22+P) + ^ 5^ 2-^^ log(22+'=^) - f E 2"' log(l - 

k=l k=l 



log 2 



oo „ oo 



2 + p + 2 ^^(2 + kp)2-'\ - y E 2"' log(l " 2"'"') < CiP, 
fc=i fc=i 



where ci = 2 for p> 2 , and ci = 2.7 for p = 1 , and the result follows. □ 

Now we specify the local bounds of Theorem 4.1 and the central result of Corollary 4.4 
to the smooth case. 

Theorem 4.7. Suppose (£(i) . For any A < vog, ro > , and v° £ T 
logiEexp|— ^ sup \U{v) - U{v°)\} < /2 + Q, 



where Q = Cip . 



dcf 

We consider the local sets of the elliptic form To{r) = {v : ||i/o(''^ — i^o)|| < ^r} , 
where Hq dominates H{v) on this set: H[v) ^ Hq . 

Theorem 4.8. Let (E-D) hold with some g and a matrix H{v) . Suppose that H{v) < 
Hq for all V G To{r) . Then 

JP( sup [-}-U{v,VQ)-\\\HQ{v-vo)f]>iQ{^,p)\ <exp(-x), (4.10) 
where 3o(x,p) coincides with 3o(x,Q) from (4.9) for Q = cip . 

Remark 4.2. An important feature of the established result is that the bound in the 
right hand-side of (4.10) does not depend on the value r describing the radius of the 
local vicinity around the central point vq . In the ideal case one would apply this result 
with r = oo provided that the conditions H{v) < Hq is fulfilled uniformly over T . 

Proof. Lemma 4.6 implies (Sd) with d{v,VQ) = \\Ho{v — vq)\\'^/2. Now the result 
follows from Corollary 4.4. □ 
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4.4 Roughness constraints for dimension reduction 

The local bounds of Theorems 4.1 and 4.3 can be extended in several directions. Here 
we briefly discuss one extension related to the use of a smoothness condition on the 
parameter v . Let let i{v) be a non-negative penalty function on T . A particular 
example of such penalty function is the roughness penalty i{v) = ||Gt^|p for a given 
matrix IRP . Let to > 1 be fixed. Redefine the sets 'B^{v°) by the constraint t{v) < to : 

BrK) = {veT: d{v,v°) < r; i{v) < to}, 

and consider T° = 'B^^oi'^") fo^^ ^ fixed central point v° and the radius Tq . One can 
easily check that the results of Theorems 4.1 and 4.3 and their corollaries extend to this 
situation without any change. The only difference is in the definition of the value Q(T°) 
and Q. Each value Q{T°) is defined via the quantities irkiu) = '^{'S'Tki''^)) which 
obviously change when each ball 'Bj-{v) is redefined. Examples below show that the use 
of the penalization can substantially reduce the value Q . 

Now we specify the results to the case of a smooth process U given on a local 
ball T° = 'Bj-^{v°) defined by the condition {||iJo('L' — < ^o} and a smoothness 
constraint ||Gt^|p < to = . The local set 'Bj-(f ) are of the form: 

= {v' : WHoiv - 'v')\\ < r, \\Gv'\\ < To}. (4.11) 

The effective dimension pe = Pe{S) can be defined as the dimension of the subspace 
on which Hq > 9 • The formal definition uses the spectral decomposition of the matrix 
S = Hq^S^Hq^ . Let < 5^2 < • • • < 5'p be the eigenvalue of S . Define Pe(5') as the 
largest index j for which gj < 1: 

Pe(5) ='max{j>l: 5j<l}. (4.12) 

In the non-penalized case, the entropy term Q is proportional to the dimension p . 
The roughness penalty enables to reduce p to the effective dimension Pe(5') which can 
be much smaller than p depending on the relation between matrices Hq and S . More 
precisely, if the eigenvalues gj of S grow sufficiently fast, the entropy calculus effectively 
reduces to the coordinates with gj < I . 

Lemma 4.9. Let gi = . For each Tq > 1 , it holds 

Q(ro(ro)) < Cp, (4.13) 

with = Ps(S') defined by 

MS) '=^%e(5) + E57'log+(5,). (4.14) 
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If the sum ^j^igj ^log_^_{gj) is bounded by a fixed constant, then the value ps is 
close to the effective dimension Pe(5') from (4.12). 

Proof. We follow the proof of Lemma 4.6. By a change of variables one can reduce the 
problem to the case when Hq is the identity matrix and Tq = 1 . Moreover, one can 
easily see that = is the hardest case. The case oi v° ^ can be considered 
similarly. By a further change the matrix S" = can be represented in diagonal form: 
S = diag{(/f < ••• < 9p} ■ Let i;^ be any point with \\v'^\\ < 1 and HS"*^^!! < 1, and 
r < 1 . Simple arguments show that the measure of the set 'Bj.{v'^) over all such vectors 
is minimized at = {1,0,..., 0)"^ . Define = r - /2 , = {1 - r'^/2)v^ . Fix 
any v such that u = v — fulfills ||ii|| < r'' , ||Sw|| < r , and Mi < . As = , it 
holds 

\\9v\\ = \\9u\\ < r < 1, 
\\v - < \\v - v^W + 11?;'' - < + r^/2 = r, 

||^;||2 = 11^^ + uf = \\v^f + \\uf + 2wV < (1 - rV2)' + |rf < 1. 

This yields that 7r(Si(0) n'Br{v^)) > 7r('B^b (0))/2 . Moreover, let the index pe(r) be 
defined as the largest j with zgj < 1 . Consider any v € 'Bi(O) and construct another 
point v{r) by multiplying with r every element vj for j < Pe(r) . The construction 
ensures that t;(r) G 23r(0) . This implies 

7r(Bi(0)) 27r(Si(0)) 



7r(Si(0)nSrM)) " vr(a3,,(0)) 

Application of this bound for k > 1 , r^+i = 2~^ , and = pe{^k+i) yields that 
2Mfc+i < 22+^'Pfc(l - 2"'=-^)-^''- . The quantity Q(T°) can now be evaluated as 



1 r) oo r, OO 

< - log(22+P^) + -Y^ 2~^ log(22+'=f^) --Y^ 2-Vfc log(l - 2-^-1). 
fc=i fe=i 

Further, for each g > 1 ,\t holds with k{g) = max{A; : g < 2^^} 

oo 

sig) ''^^f J]2-'=+iA;ll(2-'=5 < l) < 2k{g)2~''(s'^ < 2k{g)/g < 2g~Hog^{2g). 
k=i 

Thus, 

oo oo p p 

Y.2-^kpk < j;2-'^A:^]l(2-% < 1) < 2 

k=l k=l j=l j=l 

This easily implies the result (4.13); cf. the proof of Lemma 4.6. □ 
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The first result adjusts Theorem 4.7 to the penahzed case. The maximum of the 
process U is taken over a ball 'Bj-{v) from (4.11) which is smaller than the similar ball 
in the non-penalized case. This explains the gain in the entropy term Q . 

Theorem 4.10. Let {E.D) hold with some g and a matrix H{v) ^ Hq for all v & T . 

Then for any A < UQg , Tq > 1 , and v° & T 

logiBexpl^— sup \U{v) - U{v°)\] < X'^ /2 + q, 

where 23ro('L>°) is given by (4.11), Q = Cp^ , and is the effective dimension from 
(4.14). 

Proof. The result follows from Corollary 4.4. It is only required to evaluate the local 
entropy Q(To(ro)) . This is done in Lemma 4.9. □ 

The magnitude of the process U over 'B-r^{v°) is of order ro and it grows with Tq . 
The use of the negative drift allows to establish a unified result. 

Theorem 4.11. Let Tq > 1 be fixed and let (S-D) hold with some g and a matrix 
H{v) ^ Hq for all v G Sr„(^'o) • Then 

ip( sup \^U{v,vo)-h\HQ{v-vo)f}>i{x,q)) <eM-^), 

where 3(x,Q) is given by (4.9) with Q = Cp^ . 
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